add synchronization log event #1896

yhmtsai · 2025-07-21T08:39:23Z

This PR adds the event to synchronize.
I put event in executor_event collection and operations tag in profile_hook.
To distinguish profiler_hook additional sync and function sync, I allow profiler hook to call synchronize_impl such that they do not have usual syncrhonize event.

BTW, this PR is not enough such that we can distinguish the real copy time from waiting the previous gpu activity because the result in residual_norm copy is pageable memory unless we are on grace hopper or use cudaMallocHost for the resulting memory, which will handled in another PR. According to https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async, it might sync with respect to host.
In practice, the current behavior synchronizes during the cudaMemcpyAsync, so we have two sync - copy and explicit sync actually. Unfortunately, we can not eliminate the sync because cuda only mentions it "might" sync.

upsj · 2025-07-21T14:13:15Z

This looks like it is only useful for profiler_hook loggers that don't distinguish between CPU and GPU timing, i.e. only the summary and nested_summary loggers, right? I'm not sure yet if this is enough of a reason to add a separate logger event for it. If we use device timers, we get correct device timings on the high level, if we use host timers, we get correct host timings, if we want to get some mixed information, you get more reliable results from a purpose-built tracing tool like NSight Systems or rocProfiler.

Additionally, it seems like this would add some additional noise to the profiler output for executors where synchronize is a no-op.

yhmtsai · 2025-07-21T14:39:15Z

I am not big fan to provide this when we can use the vendor profiler directly.
It should not give misleading information if we provide it.
On later point, we may have benchmark with two loggers for gpu and cpu and it will help to know cpu time is used in syncrhonization not actual function although it needs different host allocator from #1897

upsj · 2025-07-21T15:14:04Z

Let me be a bit more precise: The noise is the biggest issue to me. If the executor doesn't do anything when synchronizing, it also shouldn't show up in the output.

Looking at where we actually call synchronize explicitly in Ginkgo, this seems all related to distributed use cases or copies (there are a few other places, but I think they are unnecessary). For that purpose, maybe it makes sense for us to spend some more time thinking of asynchronicity and futures, since that's essentially what we need here to avoid the duplicate synchronization. With explicit device-related futures, we can provide a function that waits for multiple streams (and if they are the same, remove any duplicate waits), which would then allow us to add asynchronous copy functionality to all executors. But also to be clear about terminology: I don't mean using multiple threads, I mean using a single thread to poll multiple futures. If we get this right, we can also cleanly extend it to a multi-stream setup, which would allow us even more overlapping of operations.

yhmtsai · 2025-07-21T16:55:16Z

Asynchronous is different topic.
We call that, so we should have it in log. if it is a noise because of no-op, we should avoid no-op call not avoid logging no-op.
If you mean the duplicated wait from copy itself and the explicit wait in gko copy,
We can not avoid duplicate synchronous, because the documentation only mention it might sync so it might not if the memory is pageable. Also, we can not force user to use cudaMallocHost to allow actual async copy here.
If you mean the duplicated wait from many copies, it is one slow down factor when we call several copy at the same time, which should be avoided.
However, you still need to call wait when you try to get the value which is indeed the updated value.

add synchronization loggable event

59fd958

yhmtsai requested a review from a team July 21, 2025 08:39

yhmtsai self-assigned this Jul 21, 2025

yhmtsai added the 1:ST:ready-for-review This PR is ready for review label Jul 21, 2025

ginkgo-bot added mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:dpcpp This is related to the DPC++ module. labels Jul 21, 2025

yhmtsai mentioned this pull request Aug 6, 2025

create seperate timers for cpu and gpu #1905

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add synchronization log event #1896

add synchronization log event #1896

Uh oh!

yhmtsai commented Jul 21, 2025 •

edited

Loading

Uh oh!

upsj commented Jul 21, 2025

Uh oh!

yhmtsai commented Jul 21, 2025

Uh oh!

upsj commented Jul 21, 2025

Uh oh!

yhmtsai commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add synchronization log event #1896

Are you sure you want to change the base?

add synchronization log event #1896

Uh oh!

Conversation

yhmtsai commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

upsj commented Jul 21, 2025

Uh oh!

yhmtsai commented Jul 21, 2025

Uh oh!

upsj commented Jul 21, 2025

Uh oh!

yhmtsai commented Jul 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhmtsai commented Jul 21, 2025 •

edited

Loading